Course: Machine Learning and Data Science for Social Good (20S856137)
Authors: Boqin Cai (boqin.cai@stud.sbg.ac.at)
Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.
In this notebook, I use python to analyze the risk of credit card customers based on the historic data. I use decision tree and random forest to model. And finally I use Grid Search Cross Validation to optimize the parameters of random forest.
Import some basic modules and configure the format of pictures in the notebook.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling
%config InlineBackend.figure_format = 'retina'
application_record_df = pd.read_csv('426827_1031720_bundle_archive/application_record.csv')
credit_record_df = pd.read_csv('426827_1031720_bundle_archive/credit_record.csv')
application_record_df
credit_record_df
print(f"The shape of application_record_df {application_record_df.shape}")
print(f"The shape of credit_record_df {credit_record_df.shape}")
print(f"The number of distinct IDs of application_record_df {len(set(application_record_df['ID']))}")
print(f"The number of distinct IDs of credit_record_df {len(set(credit_record_df['ID']))}")
print('Missing value of application_record_df')
print(application_record_df.isna().any())
print('Missing value of credit_record_df')
print(credit_record_df.isna().any())
a=application_record_df['ID']
b=credit_record_df['ID']
import collections
a=collections.Counter(a)
for i in a:
if a[i]>1:
print(i)
This is a sample of repeated ID. The ID field should be unique. But for ID 7052783, there are 2 different raws in the dataset. So we can't asure which one is correct. This might bring the problem of data merging. And there about 30 repeated IDs in the dataset. So we just drop them all.
application_record_df[application_record_df['ID']==7052783]
application_record_df=application_record_df.drop_duplicates(subset='ID', keep=False)
If a customer has no loan or has paid off, he/she will be marked with 'X' or 'C'. So in this case, I convert them to -1. The strategy of defining risky customer is those who once overdue a bill over 30 days will be marked. So in the table application_record, I analyzed all records of customers and marked normal customers as 0, risky customers as 1.
credit_record_df[credit_record_df['STATUS']=='X']=-1
credit_record_df[credit_record_df['STATUS']=='C']=-1
credit_record_df=credit_record_df.astype(int)
sns.countplot(credit_record_df.STATUS)
grouped=credit_record_df.groupby('ID').max()
grouped.loc[grouped['STATUS'] > 0, 'STATUS']=1
sns.countplot(grouped['STATUS'])
Link the two datasets with ID.
df=application_record_df.merge(grouped, how='inner', left_on='ID', right_on='ID')
df
report=pandas_profiling.ProfileReport(df)
report
CODE_GENDER, FLAG_OWN_CAR, FLAG_OWN_REALTY use string to represent the spacific meaning in the fields. But the machine learning library sklearn cannot recognize the labels. So for those 3 fileds, we use 0, 1 to instead.
CNT_CHILDREN, AMT_INCOME_TOTAL, DAYS_BIRTH, DAYS_EMPLOYED and CNT_FAM_MEMBERS are continuous variables. In this case, they are all converted to categories.
NAME_INCOME_TYPE, NAME_EDUCATION_TYPE, NAME_FAMILY_STATUS, NAME_HOUSING_TYPE, OCCUPATION_TYPE are multi-labeled columns, so I use LabelEncoder to encode those labels. Meanwhile, OCCUPATION_TYPE has missing values. In this case, I use 'Other' to replace the missing values.
The column FLAG_MOBIL only contains 1. So it is a constant that will be removed.
Output the types of the variables.
df.dtypes
CODE_GENDER, FLAG_OWN_CAR, FLAG_OWN_REALTY use string to represent the spacific meaning in the fields. But the machine learning library sklearn cannot recognize the labels. So for those 3 fileds, we use 0, 1 to instead.
from sklearn.preprocessing import LabelEncoder
CODE_GENDER_le = LabelEncoder()
df['CODE_GENDER'] = CODE_GENDER_le.fit_transform(df['CODE_GENDER'])
FLAG_OWN_CAR_le = LabelEncoder()
df['FLAG_OWN_CAR'] = FLAG_OWN_CAR_le.fit_transform(df['FLAG_OWN_CAR'])
FLAG_OWN_REALTY_le = LabelEncoder()
df['FLAG_OWN_REALTY'] = FLAG_OWN_REALTY_le.fit_transform(df['FLAG_OWN_REALTY'])
CNT_CHILDREN, AMT_INCOME_TOTAL, DAYS_BIRTH, DAYS_EMPLOYED and CNT_FAM_MEMBERS are continuous variables. In this case, they are all converted to categories.
df.loc[df['CNT_CHILDREN'] >= 2, 'CNT_CHILDREN']=2
df.loc[df['AMT_INCOME_TOTAL'] <= 200000, 'AMT_INCOME_TOTAL']=0
df.loc[(df['AMT_INCOME_TOTAL'] <=400000) & (df['AMT_INCOME_TOTAL'] > 200000), 'AMT_INCOME_TOTAL']=1
df.loc[df['AMT_INCOME_TOTAL'] >= 400000, 'AMT_INCOME_TOTAL']=2
sns.countplot(df['AMT_INCOME_TOTAL'])
df.loc[-df['DAYS_BIRTH'] <= 30*365, 'DAYS_BIRTH']=0
df.loc[(-df['DAYS_BIRTH'] > 30*365) & (-df['DAYS_BIRTH'] <= 40*365), 'DAYS_BIRTH']=1
df.loc[(-df['DAYS_BIRTH'] > 40*365) & (-df['DAYS_BIRTH'] <= 50*365), 'DAYS_BIRTH']=2
df.loc[(-df['DAYS_BIRTH'] > 50*365), 'DAYS_BIRTH']=3
df.loc[df['DAYS_EMPLOYED'] > 0, 'DAYS_EMPLOYED']=df[df['DAYS_EMPLOYED'] < 0]['DAYS_EMPLOYED'].mean()
df.DAYS_EMPLOYED.hist()
df.loc[-df['DAYS_EMPLOYED'] <= 5*365, 'DAYS_EMPLOYED']=0
df.loc[(-df['DAYS_EMPLOYED'] > 5*365) & (-df['DAYS_EMPLOYED'] <= 10*365), 'DAYS_EMPLOYED']=1
df.loc[-df['DAYS_EMPLOYED'] > 10*365, 'DAYS_EMPLOYED']=2
sns.countplot(df['DAYS_EMPLOYED'])
df.loc[df['CNT_FAM_MEMBERS'] >= 3, 'CNT_FAM_MEMBERS'] = 3
sns.countplot(df['CNT_FAM_MEMBERS'])
NAME_INCOME_TYPE, NAME_EDUCATION_TYPE, NAME_FAMILY_STATUS, NAME_HOUSING_TYPE, OCCUPATION_TYPE are multi-labeled columns, so I use LabelEncoder to encode those labels. Meanwhile, OCCUPATION_TYPE has missing values. In this case, I use 'Other' to replace the missing values.
NAME_INCOME_TYPE_le = LabelEncoder()
df['NAME_INCOME_TYPE'] = NAME_INCOME_TYPE_le.fit_transform(df['NAME_INCOME_TYPE'])
NAME_EDUCATION_TYPE_le = LabelEncoder()
df['NAME_EDUCATION_TYPE'] = NAME_EDUCATION_TYPE_le.fit_transform(df['NAME_EDUCATION_TYPE'])
NAME_FAMILY_STATUS_le = LabelEncoder()
df['NAME_FAMILY_STATUS'] = NAME_FAMILY_STATUS_le.fit_transform(df['NAME_FAMILY_STATUS'])
NAME_HOUSING_TYPE_le = LabelEncoder()
df['NAME_HOUSING_TYPE'] = NAME_HOUSING_TYPE_le.fit_transform(df['NAME_HOUSING_TYPE'])
df['OCCUPATION_TYPE']=df['OCCUPATION_TYPE'].fillna('Other')
OCCUPATION_TYPE_le = LabelEncoder()
df['OCCUPATION_TYPE'] = OCCUPATION_TYPE_le.fit_transform(df['OCCUPATION_TYPE'])
The column FLAG_MOBIL only contains 1. So it is a constant that will be removed.
df=df.drop(labels=['FLAG_MOBIL', 'MONTHS_BALANCE'],axis=1)
df
Usually, we need to split dataset into 2 parts for model testing. Empirically, 70% of data are used for training dataset, 30% of data are use for testing. The status is the dependent variable. And there are 16 independent variables.
from sklearn.model_selection import train_test_split
X=df.iloc[:,1:-1]
y=df['STATUS']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=100)
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
model=DecisionTreeClassifier(max_depth=20, min_samples_leaf=30)
model.fit(X_train,y_train)
y_predict_train=model.predict(X_train)
y_predict=model.predict(X_test)
print('Accuracy of training dataset: ', accuracy_score(y_train, y_predict_train))
print('Accuracy of test dataset: ', accuracy_score(y_test, y_predict))
print('Confusion matrix of training dataset: \n', confusion_matrix(y_train,y_predict_train))
print('Confusion matrix of test dataset: \n', confusion_matrix(y_test,y_predict))
ax=sns.heatmap(confusion_matrix(y_train,y_predict_train),cmap=plt.cm.Blues,annot=True)
ax.set_title('Confusion matrix of training dataset')
plt.show()
ax=sns.heatmap(confusion_matrix(y_test,y_predict),cmap=plt.cm.Blues,annot=True)
ax.set_title('Confusion matrix of test dataset')
plt.show()
If we draw the decision tree, we can find something goes strange. OCCUPATION_TYPE indicates the different types of jobs. The original data is in categories. But the machine learning models in sklearn only support number as input. So we encoded it. But the number here has no spacific mieaning. It's not comparable. But the decision tree model regard it as a number and split the node with the range of number.
from sklearn import tree
import graphviz
dot_tree=tree.export_graphviz(model, feature_names = X_train.columns, max_depth=2, filled=True)
graph = graphviz.Source(dot_tree)
graph
To solve this problem, I use one-hot encoding to re-encode the whole dataset. One-hot encoding can separate the labels in a column into columns with only 0 and 1.
dummy_columns=['CNT_CHILDREN', 'AMT_INCOME_TOTAL','NAME_INCOME_TYPE',
'NAME_EDUCATION_TYPE','NAME_FAMILY_STATUS','NAME_HOUSING_TYPE',
'DAYS_BIRTH','DAYS_EMPLOYED','OCCUPATION_TYPE','CNT_FAM_MEMBERS']
dummy_X=pd.get_dummies(X, columns=dummy_columns)
dummy_X
X_train, X_test, y_train, y_test = train_test_split(dummy_X, y, stratify=y, test_size=0.3, random_state=100)
model=DecisionTreeClassifier(max_depth=20, min_samples_leaf=30)
model.fit(X_train,y_train)
y_predict_train=model.predict(X_train)
y_predict=model.predict(X_test)
print('Accuracy of training dataset: ', accuracy_score(y_train, y_predict_train))
print('Accuracy of test dataset: ', accuracy_score(y_test, y_predict))
print('Confusion matrix of training dataset: \n', confusion_matrix(y_train,y_predict_train))
print('Confusion matrix of test dataset: \n', confusion_matrix(y_test,y_predict))
ax=sns.heatmap(confusion_matrix(y_train,y_predict_train),cmap=plt.cm.Blues,annot=True)
ax.set_title('Confusion matrix of training dataset')
plt.show()
ax=sns.heatmap(confusion_matrix(y_test,y_predict),cmap=plt.cm.Blues,annot=True)
ax.set_title('Confusion matrix of test dataset')
plt.show()
The structure of the decision tree
from sklearn import tree
import graphviz
dot_tree=tree.export_graphviz(model,feature_names = dummy_X.columns, max_depth=2,
filled=True)
graph = graphviz.Source(dot_tree)
graph
But, still, we have another problem. The dataset is imbalanced. There are only 27711 marked 0 and 4291 marked 1. So, we need to balance the dataset to enhance the model's generalization. Here I used the Synthetic Minority Over-sampling TEchnique (SMOTE) to oversample to minority.
from imblearn.over_sampling import SMOTE
print('Before oversampling: \n', y.groupby(y).count())
X_balance,y_balance = SMOTE().fit_sample(dummy_X,y)
X_train, X_test, y_train, y_test = train_test_split(X_balance, y_balance, stratify=y_balance, test_size=0.3, random_state=100)
print('After oversampling: \n', y_balance.groupby(y_balance).count())
Now I use the Random Forest to model the data. A single decision tree is a weak classifier. It's easy to overfit or underfit the data. So random forest can solve this problem by using bagging strategy.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, max_depth=30, min_samples_leaf=5)
model.fit(X_train,y_train)
y_predict_train=model.predict(X_train)
y_predict=model.predict(X_test)
print('Accuracy of training dataset: ', accuracy_score(y_train, y_predict_train))
print('Accuracy of test dataset: ', accuracy_score(y_test, y_predict))
print('Confusion matrix of training dataset: \n', confusion_matrix(y_train,y_predict_train))
print('Confusion matrix of test dataset: \n', confusion_matrix(y_test,y_predict))
ax=sns.heatmap(confusion_matrix(y_train,y_predict_train),cmap=plt.cm.Blues,annot=True)
ax.set_title('Confusion matrix of training dataset')
plt.show()
ax=sns.heatmap(confusion_matrix(y_test,y_predict),cmap=plt.cm.Blues,annot=True)
ax.set_title('Confusion matrix of test dataset')
plt.show()
Empirical parameters might not be the best choice for a model. So I used grid search cross validation to find the best combination of parameters. Here I chose 3 parameters for optimization, n_estimators, min_samples_leaf and max_depth. They will be test in a range one by one and return a model with the highest accuracy.
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
param_test1 = {'n_estimators':range(20,101,10), 'min_samples_leaf':range(2,20,2), 'max_depth':range(10,100,5)}
gsearch1 = GridSearchCV(estimator = RandomForestClassifier(max_features='sqrt',
random_state=10),
param_grid = param_test1, scoring='roc_auc',iid=False,cv=5,n_jobs=4)
gsearch1.fit(X_train,y_train)
gsearch1.best_estimator_ , gsearch1.best_params_, gsearch1.best_score_
y_predict_train=gsearch1.predict(X_train)
y_predict=gsearch1.predict(X_test)
print('Accuracy of training dataset: ', accuracy_score(y_train, y_predict_train))
print('Accuracy of test dataset: ', accuracy_score(y_test, y_predict))
print('Confusion matrix of training dataset: \n', confusion_matrix(y_train,y_predict_train))
print('Confusion matrix of test dataset: \n', confusion_matrix(y_test,y_predict))
Basically, I used Python + Jupyter notebook to finish all the data analysis. Final report will be presented as a notebook.
After data exploration, I found some problems in the dataset.
So, I use pandas to clean the data. And I re-encode the whole dataset. All the fields after data cleaning are categorical variable, which is not linear. So I chose non-linear machine learning methods to solve the problem, decision tree and random forest.
The simple decision tree can’t fit data well because the original data set is extremely imbalanced. So I used Synthetic Minority Over-sampling TEchnique (SMOTE) to oversample the dataset. Also, I used random forest based on bagging strategy to avoid overfitting or underfitting.
Usually, empirical parameters might not be the best choice for a model. So I used grid search cross validation to find the best combination of parameters. Here I chose 3 parameters for optimization, which are n_estimators, min_samples_leaf and max_depth. They will be tested in a range one by one and return a model with the highest accuracy.
Finally, the result of accuracy of random forest on training dataset reaches 0.88, and the test dataset also reaches 0.85, which is better than the decision tree.
Course: Machine Learning and Data Science for Social Good (20S856137)
Authors: Boqin Cai (boqin.cai@stud.sbg.ac.at)